Search Results: "brad"

30 March 2012

Joey Hess: podcasts that don't suck

My public radio station is engaged in a most obnoxious spring pledge drive. Good time to listen to podcasts. Here are the ones I'm currently liking.

22 January 2012

Kees Cook: fixing vulnerabilities with systemtap

Recently the upstream Linux kernel released a fix for a serious security vulnerability (CVE-2012-0056) without coordinating with Linux distributions, leaving a window of vulnerability open for end users. Luckily: Still, it s a cross-architecture local root escalation on most common installations. Don t stop reading just because you don t have a local user base attackers can use this to elevate privileges from your user, or from the web server s user, etc. Since there is now a nearly-complete walk-through, the urgency for fixing this is higher. While you re waiting for your distribution s kernel update, you can use systemtap to change your kernel s running behavior. RedHat suggested this, and here s how to do it in Debian and Ubuntu: In this case, the systemtap script is changing the argument containing the size of the write to zero bytes ($count = 0), which effectively closes this vulnerability. UPDATE: here s a systemtap script from Soren that doesn t require the full debug symbols. Sneaky, put can be rather slow since it hooks all writes in the system. :)

2012, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

11 December 2011

Stefano Zacchiroli: bits from the DPL for November 2011

Mako's IronBlogger is a great idea. I often find myself postponing blog posts for a very long time, simply out of laziness. IronBlogger provides a nice community incentive to counter (my) laziness and blogging more often. As a related challenge, we have to face the fact that different subsets of our communities use different media to stay informed: mailing lists, blog (aggregators), social media, IRC, etc. Disparities in how they stay informed are a pity and can be countered using multiple medias at a time. Although I haven't blogged very often as of lately, I managed to keep the Debian (Developer) community informed of what happens in "DPL land" on a monthly basis, by the means of bits from the DPL mails sent to d-d-a. While the target of bits mails perfectly fits d-d-a, there is no reason to exclude a broader public from them. After all, who knows, maybe we'll find the next DPL victim^W candidate among Planet readers! Bonus point: blogging this also helped me realize that my mails are not as markdown-clean as I thought they were. I still have no IronBlogger squad, though. (And sharing beers with folks in the Boston area is not terribly handy for me ). Anyone interested in setting up a BloggeurDeFer in the Paris area? (SCNR)
Dear Project Members,
another month has passed, it's time to bother you again about what has happened in DPL land in November (this time, with even less delay than the last one, ah!). Call for Help: press/publicity team I'd like to highlight the call for help by the press / publicity teams. They are "hiring" and have sent out a call for new members a couple of weeks ago. The work they do is amazing and is very important for Debian, as important as maintaining packages or fixing RC bugs during a freeze. It is only by letting the world know what Debian is and what we do, that we can keep the Project thriving. And letting the world know is exactly what the publicity and press teams do. If you're into writing, blogging, or simply have a crush for social media, please read the call and "apply"! Interviews November has apparently been the "let's interview the DPL" month. I've spent quite some time giving interviews to interested journalists about various topics. For both my embarrassment and transparency on what I've said on behalf of Debian, here are the relevant links: Assets Legal advice (work in progress) Relationships with others Miscellanea Thanks for reading thus far,
and happy hacking.
PS as usual, the boring day-to-day activity log is available at master:/srv/leader/news/bits-from-the-DPL.*

21 November 2011

Alastair McKinstry: From a comment by Brad K at Causon's Book:

From a comment by Brad K at Causon's Book: It is a couple of days old now, so I imagine you have seen this. One hedge broker for grain and cattle, Barnhardt Capital Management (http://barnhardt.biz/) has called it quits. Most agribusiness deals with futures and options on their crops to make a good part of their annual income. According to Ann: "The reason for my decision to pull the plug was excruciatingly simple: I could no longer tell my clients that their monies and positions were safe in the futures and options markets because they are not. And this goes not just for my clients, but for every futures and options account in the United States. The entire system has been utterly destroyed by the MF Global collapse. " Interesting times. As the global financial system (not just banks) is wobbling, we need to ensure the food supplies. Time to plant veg and stock up on tins?

19 October 2011

Yves-Alexis Perez: Debian grsec kernels

I received recently a mail about my attempt to provide Grsecurity kernels in Debian. The sender found the bug by accident, and asked me why I didn't do some more publicity here. So here we are. I won't go into details on what grsecurity is, it's fairly complex. But it's basically a hardening patch for the Linux kernel, with three main components: A lot of this touches low level stuff in the kernel, especially memory management. Ideally this patch would be pushed upstream, but Brad Spengler (grsecurity main developper) already said he wasn't interested in upstreaming it and upstream already said the patch was too huge and invasive to include it like that (especially since the original authors aren't interested in maintaining it upstream). There's an ongoing effort to split the patch and merge things little by little, but in the meanwhile having a mid-term solution would be nice. I know Debian users rebuilding grsecurity-patched kernels themselves, and I know some of them would appreciate having them included in the Debian kernel. Fortunately, the linux-2.6 source package has a nice feature which is called featureset. Basically it's a way to build some (binary) packages using a different set of patches and a different config. For example this was used to provide xen/openvz/vserver patchsets, and is now used to provide rt kernels. So I though it'd be nice to provide a grsec featureset, and starting doing the work. I have a working setup for producing those kernels, so I've opened a wishlist bug against the kernel (#605090) to have this merged. Those packages follow the sid kernel. There's an ongoing work for Squeeze, but it's a bit harder there because both the grsecurity patchset and the Debian kernel ship a whole lot of backports to the Linux kernel, meaning the grsecurity patch doesn't apply directly to the Debian source package. Basically I need to remove some of the hunks (since they are already applied to the source) and port some others (since there are some backported code not present in the vanilla 2.6.32, for example the drm code). Until the patches are merged and the bug is closed, I host some of the built packages at:

deb http://molly.corsac.net/~corsac/debian/kernel-grsec/packages/ sid/ The repository is signed by my key which you can add to your apt setup using apt-key add. If you want to rebuild the packages yourself, here's the method:

mkdir kernel-grsec
cd kernel-grsec
svn checkout svn://svn.debian.org/svn/kernel/dists/sid/linux-2.6
git clone git://anonscm.debian.org/users/corsac/grsec-patches.git
wget http://www.kernel.org/pub/linux/kernel/v3.0/linux-3.0.tar.bz2
wget http://www.kernel.org/pub/linux/kernel/v3.0/linux-3.0.tar.bz2.sign
gpg --verify linux-3.0.tar.bz2
cd linux-2.6
apt-get build-dep linux-2.6
export QUILT_PATCHES=../grsec-patches
quilt push -a
python debian/bin/genorig.py ../linux-3.0.tar.bz2
debian/rules orig
fakeroot debian/rules source
fakeroot make -f debian/rules.gen binary-arch_amd64_grsec_amd64 You could also do dpkg-buildpackage, pdebuild or whatever. Kernel handbook is a nice reading too if you want more information on how to rebuild Debian kernels. The quilt push -a may fail if you checkout an svn version more recent than mine. I try to keep patches up to date but I usually have some delay. Note that installing the kernel will require installing linux-grsec-base package. Binary is not yet available on my mirror but you can easily build it. Source can be found on git.debian.org. If you're interested by this, don't hesitate to mail me or the bug.

26 September 2011

Gunnar Wolf: e-voting: Something is brewing in Jalisco...

There's something brewing, moving in Jalisco (a state in Mexico's West, where our second largest city, Guadalajara, is located). And it seems we have an opportunity to participate, hopefully to be taken into account for the future. Ten days ago, I was contacted by phone by the staff of UDG Noticias, for an interview on the Universidad de Guadalajara radio station. The topic? Electronic voting. If you are interested in what I said there, you can get the interview from my webpage. I held some e-mail contact with the interviewer, and during the past few days, he sent me some links to notes in the La Jornada de Jalisco newspaper, and asked for my opinion on them: On September 23, a fellow UNAM researcher, C sar Astudillo, claims the experience in three municipalities in Jalisco prove that e-voting is viable in the state, and today (September 26), third generation of an electronic booth is appearingly invulnerable. Of course, I don't agree with the arguments presented (and I'll reproduce the mails I sent to UDG Noticias about it before my second interview just below They are in Spanish, though). However, what I liked here is that it does feel like a dialogue. Their successive texts seem to answer to my questioning. So, even though I cannot yet claim this is a real dialogue (it would be much better to be able to sit down face to face and have a fluid conversation), it feels very nice to actually be listened to from the other side! My answer to the first note:
El tema de las urnas electr nicas sigue dando de qu hablar por ac en Jalisco... nosotros en Medios UDG hemos presentado distintas voces como la del Dr. Gabriel Corona Armenta, que est a favor del voto electr nico, del Dr. Luis Antonio Sobrado, magistrado presidente del tribunal supremo de elecciones de Costa Rica, quien nos habl sobre los 20 MDD que les cuesta implementar el sistema por lo que no lo han logrado hasta el momento, pudimos hablar hasta argentina con Federico Heinz y su rotunda oposici n al voto electr nico y por supuesto la entrevista que le realizamos a usted. Sin embargo este d a La Jornada Jalisco publica la siguiente nota http://www.lajornadajalisco.com.mx/2011/09/23/index.php?section=politica... nos gustar a saber cu l es su punto de vista al respecto, quedo a la espera de su respuesta
Hola, Pues... Bueno, s que el IFE hizo un desarrollo muy interesante y bien hecho hace un par de a os, dise ando desde cero las urnas que propon an emplear, pero no se instrumentaron fuera de pilotos (por cuesti n de costos, hasta donde entiendo). Se me hace triste y peligroso que el IEPC de Jalisco est proponiendo, teniendo ese antecedente, la compra de tecnolog a prefabricada, y confiando en lo que les ofrece un proveedor. Se me hace bastante iluso, directamente, lo que propone el t tulo: comicios en tres municipios prueban la viabilidad del voto electr nico en todo el estado . Pong moslo en estos t rminos: El que no se caiga una choza de l mina con estructura de madera demuestra que podemos construir rascacielos de l mina con estructura de madera? Ahora, un par de p rrafos que me llaman la atenci n de lo que publica esta nota de La Jornada:
la propuesta de realizar la elecci n en todo el estado con urnas electr nicas que desea llevar a cabo el Instituto Electoral y de Participaci n Ciudadana (IEPC) es viable, pues los comicios realizados en tres municipios son pruebas suficientes para demostrar que la urna es fiable
y algunos p rrafos m s adelante,
Cu ntas experiencias m s se necesitan para saber si es confiable, 20, 30, no lo s (...) Pero cuando se tiene un diagn stico real, efectivo y serio de cu ndo t cnicamente procede, se puede tomar la decisi n
Como lo menciono en mi art culo... No podemos confundir a la ausencia de evidencia con la evidencia de ausencia. Esto es, que en un despliegue menor no haya habido irregulares no significa que no pueda haberlas. Que haya pa ses que operan 100% con urnas electr nicas no significa que sea el camino a seguir. Hay algunas -y no pocas- experiencias de fallas en diversos sentidos de urnas electr nicas, y eso demuestra que no puede haber confianza en las implementaciones. Aunque el equipo nos saliera gratis (que no es el caso), hay que invertir recursos en su resguardo y mantenimiento. Aunque se generara un rastro impreso verificado por el votante (que s lo ha sido el caso en una peque a fracci n de las estacione de votaci n), nada asegura que los resultados reportados por el equipo sean siempre consistentes con la realidad. El potencial para mal uso que ofrecen es demasiado. Saludos,
And to September 26th:
Disculpe que lo molestemos otra vez, pero este d a fue publicada otra nota m s sobre el tema de las Urnas electr nicas en Jalisco donde se asegura que la urna es invulnerable. http://www.lajornadajalisco.com.mx/2011/09/26/index.php?section=politica... nos podr a conceder unos minutos para hablar con usted, como la vez pasada, v a telef nica sobre el caso espec fico de Jalisco, en referencia a estas notas publicadas recientemente? si es posible podr a llamarle este d a a las 2 pm? Quedo a la espera de su respuesta agradeci ndole su ayuda, apreciamos mucho esta colaboraci n que est haciendo con nosotros
Hola, ( ) Respecto a esta nota: Nuevamente, ausencia de evidencia no es evidencia de ausencia. Se le permite a un peque o segmento de personas jugar con una m quina. Significa eso que fue una prueba completa, exhaustiva? No, s lo que ante un jugueteo casual no pudieron encontrar fallos obvios y graves. Un verdadero proceso que brindara confianza consistir a en (como lo hicieron en Brasil - Y resultaron vulnerables) convocar a la comunidad de expertos en seguridad en c mputo a hacer las pruebas que juzguen necesarias teniendo un nivel razonable de acceso al equipo. Adem s, la seguridad va m s all de modificar los resultados guardados. Un par de ejemplos que se me ocurren sin darle muchas vueltas:
  • Qu pasa si meto un chicle a la ranura lectora de tarjeta magn tica?
  • Qu pasa si golpeo alguna de las teclas lo suficiente para hacerla un poquito menos sensible sin destruirla por completo? (o, ya entrados en gastos, si la destruyo)
La negaci n de servicio es otro tipo de ataque con el cual tenemos que estar familiarizados. No s lo es posible modificar el sentido de la votaci n, sino que es muy f cil impedir que la poblaci n ejerza su derecho. Qu har an en este caso? Bueno, podr an caer de vuelta a votaci n sobre papel - Sobre hojas de un block, probablemente firmadas por cada uno de los funcionarios, por ejemplo. Pero si un atacante bloque la lectura de la tarjeta magn tica, que es necesaria para que el presidente de casilla la marque como cerrada, despoj de su voto a los usuarios. S , se tienen los votos impresos (que, francamente, me da mucho gusto ver que esta urna los maneja de esta manera). El conteo es posible, aunque un poco m s inc modo que en una votaci n tradicional (porque hay que revisar cu les son los que est n marcados como invalidados - no me queda muy claro c mo es el escenario del elector que vot por una opci n, se imprimi otra, y el resultado fue corregido y marcado como tal)... Pero es posible. Sin embargo, y para cerrar con esta respuesta: Si hacemos una corrida de prueba, en circunstancias controladas, obviamente no se notar n los much simos fallos que una urna electr nica puede introducir cuando los "chicos malos" son sus programadores. Podemos estar seguro que este marcador Atlas-Chivas-Cruz Azul tenga el mismo ndice de fiabilidad como una elecci n de candidatos reales, uno de los cuales puede haberle pagado a la empresa desarrolladora para manipular la elecci n? Y a n si el proceso fuera perfecto, indican aqu que est n _intentando_ licitar estas urnas (y nuevamente, si lo que menciona esta nota es cierto, son de las mejores urnas disponibles, y han atendido a muchos de los se alamientos - Qu bueno!)... Para qu ? Qu nos van a dar estas urnas, qu va a ganar la sociedad? Mayor rapidez? Despreciable - Media hora de ganancia. A cambio de cu nto dinero? Mayor confiabilidad? Me queda claro que no, siendo que no s lo somos cuatro trasnochados los que ponemos su sistema en duda, sino que sus mismos proponentes apuntan a la duda generalizada. La frase con la que cierra la nota se me hace digna para colgar un ep logo: "en ese futuro quiz no tan distante la corrupci n tambi n ocurre y sta se debe siempre al factor humano". Y el factor humano sigue ah . Las urnas electr nicas son programadas por personas, por personas falibles. Sin importar del lado que est n, recordar n la pol mica cuando se hizo p blico que la agregaci n de votos en el 2006 fue supervisada por la empresa Hildebrando, propiedad del cu ado del entonces candidato a la presidencia Felipe Calder n. Qu evita que caigamos en un escenario similar, pero ampliamente distribu do? Y aqu hay que referirnos a la sentencia de la Suprema Corte de Alemania: En dicho pa s, las votaciones electr nicas fueron declaradas anticonstitucionales porque s lo un grupo de especialistas podr an auditarlas. Una caja llena de papeles con la evidencia clara del sentido del voto de cada participante puede ser comprendida por cualquier ciudadano. El c digo que controla a las urnas electr nicas, s lo por un peque o porcentaje de la poblaci n.

1 September 2011

Matthew Garrett: The Android/GPL situation

There was another upsurge in discussion of Android GPL issues last month, triggered by couple of posts by Edward Naughton, followed by another by Florian Mueller. The central thrust is that section 4 of GPLv2 terminates your license on violation, and you need the copyright holders to grant you a new one. If they don't then you don't get to distribute any more copies of the code, even if you've now come into compliance. TLDR; most Android vendors are no longer permitted to distribute Linux.

I'll get to that shortly. There's a few other issues that could do with some clarification. The first is Naughton's insinuation that Google are violating the GPL due to Honeycomb being closed or their "license washing" of some headers. There's no evidence whatsoever that Google have failed to fulfil their GPL obligations in terms of providing source to anyone who received GPL-covered binaries from them. If anyone has some, please do get in touch. Some vendors do appear to be unwilling to hand over code for GPLed bits of Honeycomb. That's an issue with the vendors, not Google.

His second point is more interesting, but the summary is "Google took some GPLed header files and relicensed them under Apache 2.0, and they've taken some other people's GPLv2 code and put it under Apache 2.0 as well". As far as the headers go, there's probably not much to see here. The intent was to produce a set of headers for the C library by taking the kernel headers and removing the kernel-only components. The majority of what's left is just structure definitions and function prototypes, and is almost certainly not copyrightable. And remember that these are the headers that are distributed with the kernel and intended for consumption by userspace. If any of the remaining macros or inline functions are genuinely covered by the GPLv2, any userspace application including them would end up a derived work. This is clearly not the intention of the authors of the code. The risk to Google here is indistinguishable from zero.

How about the repurposing of other code? Naughton's most explicit description is:

For example, Android uses bootcharting logic, which uses the 'bootchartd' script provided by www.bootchart.org, but a C re-implementation that is directly compiled into our init program. The license that appears at www.bootchart.org is the GPLv2, not the Apache 2.0 license that Google claims for its implementation.

, but there's no indication that Google's reimplementation is a derived work of the GPLv2 original.

In summary: No sign that Google's violating the GPL.

Florian's post appears to be pretty much factually correct, other than this bit discussing the SFLC/Best Buy case:

I personally believe that intellectual property rights should usually be enforced against infringing publishers/manufacturers rather than mere resellers, but that's a separate issue.

The case in question was filed against Best Buy because Best Buy were manufacturing infringing devices. It was a set of own-brand Blu Ray players that incorporated Busybox. Best Buy were not a mere reseller.

Anyway. Back to the original point. Nobody appears to disagree that section 4 of the GPLv2 means that violating the license results in total termination of the license. The disagreement is over what happens next. Armijn Hemel, who has done various work on helping companies get back into compliance, believes that simply downloading a new copy of the code will result in a new license being granted, and that he's received legal advice that supports that. Bradley Kuhn disagrees. And the FSF seem to be on his side.

The relevant language in v2 is:

You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License.

The relevant language in v3 is:

You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License

which is awfully similar. However, v3 follows that up with:

However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation.

In other words, with v3 you get your license back providing you're in compliance. This doesn't mesh too well with the assumption that you can get a new license by downloading a new copy of the software. It seems pretty clear that the intent of GPLv2 was that the license termination was final and required explicit reinstatement.

So whose interpretation is correct? At this point we really don't know - the only people who've tried to use this aspect of the GPL are the SFLC, and as part of their settlements they've always reinstated permission to distribute Busybox. There's no clear legal precedent. Which makes things a little awkward.

It's not possible to absolutely say that many Android distributors no longer have the right to distribute Linux. But nor is it possible to absolutely say that they haven't lost that right. Any sufficiently motivated kernel copyright holder probably could engage in a pretty effective shakedown racket against Android vendors. Whether they will do remains to be seen, but honestly if I were an Android vendor I'd be worried. There's plenty of people out there who hold copyright over significant parts of the kernel. Would you really bet on all of them being individuals of extreme virtue?

comment count unavailable comments

18 August 2011

Asheesh Laroia: OpenHatch round two: the non-profit

For the past year or two, readers of asheesh.org (including those on Planet Debian) have been hearing on and off about OpenHatch, a project that began in Atlanta two summers ago. The OpenHatch website has been a place to find out how new contributors can get involved in free software. Lately, I've discovered how much fun it is to help people get involved. I've also discovered oodles of enthusiasm for learning more about joining an open source project. So I've been transitioning OpenHatch to be more of a non-profit and to work on more of those outreach events, and particularly I've been transitioning my life to support me (self-funded) working on that effort full-time for a year. If there's one thing I learned while creating a startup under incubation, it was how to save money. The OpenHatch blog has the rest of the story. Here's a taste:
I m writing to announce three big changes for the project. First, OpenHatch is changing its organizational structure to reflect our not-for-profit goals. Second, we ll emphasize our new work beyond the website, building and promoting outreach events that bring new people into the free software community. Finally, I am taking a year to do that full-time as the project lead of OpenHatch in Somerville, MA.
This change has been a long-time coming, and it's thanks to so many people who have given advice and feedback along the way. One special shout-out goes to Bradley Kuhn, who told me in March 2010 that OpenHatch should be a non-profit. I hope you'll read more.

13 July 2011

Steve McIntyre: Project Harmony?

So, the "Harmony Project" launched their set of contributor agreements and tools last week. Colour me unimpressed... There's a claim on their website that they are a "community-centered group", but I don't see any list of people and organisations who contributed to this work. That bothers me. Regarding their aim to "assist organisations which use contribution agreements", I don't think that there is anything of value here for the Free Software community at all. Free Software developers don't need contribution agreements, and in my opinion encouraging their use like this is only going to cause further splintering of the community. We've managed for a very long time without them, why start now? As a developer, I personally don't believe in contribution agreements at all. If I contribute code to a project, it will be under the terms of a good Free Software license or not at all. That's all that's needed. There's a fair body of opinion out there on this - see pieces from Bradley Kuhn, Richard Fontana and Dave Neary for more discussion. What do you think?

5 July 2011

John Goerzen: The Lives of Others

It s not very often that I watch a movie anymore. It s been a few years since I ve actually purchased one (normally I see them from Netflix). But yesterday I saw one that may change that. The Lives of Others is an incredible film set in the former East Germany (GDR/DDR) mostly in 1984. The authenticity of it is incredible and so is the story. It s subtitled, but if you re an American wary of subtitled European films, don t be wary of this one. It is easy to watch and worth every minute. The story revolves around the Stasi, the GDR Ministry for State Security ( secret police ). It is an incredible picture of what living in a police state was like, and how many of the informants were victims of the regime too. My breath caught near the beginning of the film, showing the inside of a Stasi building. A prisoner was being interrogated for helping someone attempt to escape to the west. But the reason my breath caught was this incredible feeling of I was there . Last year, Terah and I were in Leipzig and visited the Stasi museum there, Museum in der Runden Ecke . I always have an incredible sense of history when being in a preserved place, and this building was literally the Stasi headquarters for Leipzig. Much of it was preserved intact, and seeing it in the film brought home even more vividly the terrible things that happened in that building, and others like it, not so very long ago. IMG_2717 We watched the special features on the Blu-Ray disc, and one of them was an interview with director Florian Henckel von Donnersmarck. He described how he spent a lot of time interviewing both victims of the Stasi, as well as ex-Stasi officers. One of the most disturbing things to me was his almost offhand comment that most of the former Stasi officers still had some pride in performing their jobs well. Even now, freed of the state s ideology, they were proud of the work they did which could be put most charitably as ruining people s lives. What leads a person to view life that way? How can we try to make sure it doesn t happen again elsewhere? I am happy to say that most of us have never experienced anything like the Stasi. And yet, small reflections of that mindset can be seen almost everywhere. Societies at wartime or feeling under threat, even Western democracies, can drum up those feelings. In the USA, for instance, the McCarthyism era saw people s careers ruined for alleged anti-state behavior. Contemporary examples include the indefinite detention (I hate that word; shouldn t we say imprisonment ?) of terrorism suspects at Guantanamo Bay, and the terrible treatment of Bradley Manning, who revealed some true but embarrassing things about the US military which really needed to be revealed. Even tobacco farmers and companies are selling a product they know ruins lives, but somehow keep doing it. And there are still members of the public that try to make life difficult for people that don t think like they do. From organizing campaigns of telephone harassment of colleges that don t perform the American national anthem before sporting events, to tossing about the term un-American (a loaded McCarthyist one, which many may not even be aware) at an inflated rate, we are not immune from attempts at forcing conformity or silence in others, and blind loyalty to state. I am never in a particularly celebratory mood on July 4, the biggest day for American boasting, faux patriotism, militarism, and general flag-waving. We do have a lot to be proud of and thankful for, but it seems that we celebrate all the wrong things on July 4, and see it as an occasion to proclaim American exceptionalism rather than as one to see how far we ve come and bolster hope for how far we can, and should, yet go. No, I don t think that the land of the free ought to have operated secret prisons in Europe (nor the Europeans to have been complicit in it), or that the American military was defending our freedom 100% of the time they were deployed, or that it is right for governments to mandate daily recitation of an untrue document (the pledge of allegiance) in schools. And yet, I am mindful that I have a lot to be thankful for stability, lack of much internal violent conflict, etc. And this particular day I am happy that a post like this is not something that gets the attention of some government agency and mostly that I will have a handful of angry emails to delete.

15 May 2011

Christian Perrier: 2011 week 19 Debian work

That was a damn busy week. It was mostly centered about attending SambaXP, the annual Samba user and developers conference, in G ttingen, Germany. The only free software conference I attend with expenses paid by my employer, Onera. This year was the 10th edition and, as last year, to the "who was there for the nth edition" game, I won by staying alone as they asked who attended all editions of the conference. :-) That was a great week, with time spent with people as interesting as Andrew Tridgell, Jeremy Allison, John Terpstra, Volker Lendecke, Kai Blin, to name a few. A good opportunity, again, to get input from our packaging work for that big piece of software, as well as getting visibility about the future of Samba. I also had a great, even if short talk, with the kind Karolin Seeger, the release manager of Samba for 3 years now. We talked about....children, as she's now a mother since last year (with a non negligible impact on her professionnal life, as often in Germany). Great meeting, too, with Brad Kuhn, from the Software Freedom Conservacy, who had a keynote about GPL licence enforcment activities. It becomes more and more sure that Samba3 and Samba4 will reconverge together after the Samba Team releases Samba 3.6. It brings plans for our packaging work: I think we'll stick with having samba 3.6 in wheezy while the brand new shiny Samba4 probably stays separate in some way. Our users (and /me first) clearly need stability in the file and print services first. Of course, I did some packaging work there: samba 3.6.0pre3 was uploaded to experimental, about 10 days after its official announcement. I also worked on the samba *binary* package bugs, triaging them as usual. We now have 51 bugs opened against the samba binary package: 18 unclassified, 11 moreinfo (several likely to be closed as unreproducible or user error), 1 wontfix, 8 with a pending patch and 13 forwarded upstream. I'm also thinking about a possible way to ask about SMB2 support in samba: it won't be activated by default in 3.6 (mostly because us, distros, requested for that and, by "us", I mean Debian, RHEL, SuSe and their derivatives, so quite a large consensus). Still, it would be good to put some light on SMB2 support and a debconf question about it could be a solution (not shown by default and defaulting to no SMB2). I also worked quite extensively on packages maintained by No l K the, Ralf Treinen and me, aka "the pkg-running team". I did setup a git repo for my new "garmin-ant-downloader" package, that allows downloading track files from Garmin Forerunner 405 GPS watches (guess what is the brand and model of mine!). My first packaging git repository! Thanks to Ralf for his advice and help in this. I triaged bugs in the other two packages we maintain: pytrainer (more bugs forwarded upstream) and garmin-forerunner-tools (which was later uploaded by Ralf). I also setup a team mailign list so, now, we're a real team...:-) Few activity on the l10n front: a few Smith reviews are in progress and I completed 1 or 2 French translations, and reviewed some others. Regular activity, then. The only specific stuff is that I'm now pushing harder for the French DDTP effort, doing many reviews and translations there. We try to reach 100% in the "popcon500" packages. Later, we'll try to head at reaching the hieghts reached by the Italian and German teams, who are, on this l10n activity, way ahead from us. Finally, during the SambaXP conference, and as usual (except last year because of too heavy work duties), I visited my German friends, living "close to" G ttingen, accomppanied by Luk Claes and his friend and colleague Ivo, who were also attending SambaXP. Great barbecue at Andreas and Kathrin Tille's place, facing the Wernigerode castle at sunset. And the best sp tzle ever at Meike and Alex Reichle's place in Hildesheim, with a french touch on the salad's dressing as well as great Chilean wine brought by Meike's coworker Wolfram. Always a great time to see these good friends even if that means driving a few hours (and being flashed....twice!...by german speed cameras on the way to and back Andreas place!) To complete the week, I ran a 34km/800m+ trail today in the Rambouillet forest, completing it in 3h31. I'll probably blog separately about running updates as it is now quite some time that I didn't. Guess what? I'll be sleeping well tonight...

29 April 2011

Rob Bradford: Nicely formatting number types

I came across the following compiler warning today:
    CC     libplurk_la-plurk-item-view.lo
  plurk-item-view.c: In function  construct_image_url :
plurk-item-view.c 233  warning: format  %lld  expects type  long long int ,
but argument 3 has type  gint64 
Tut tut tut .. a compiler warning! How could the commiter had let this happen? Lets look at the code
url = g_strdup_printf ("http://avatars.plurk.com/%s-medium%lld.gif", ...);
Ahaha .. they re probably using a 32-bit system, the nice thing to do here is:
url = g_strdup_printf ("http://avatars.plurk.com/%s-medium%" G_GINT64_FORMAT ".gif", ...);
To handle the case that on a 64-bit system you can represent a gint64 as a long rather than needing to go for a long long. I ve seen this quite a bit with debugging output for size_t for which G_GSIZE_FORMAT is definitely your friend.

4 April 2011

Rob Bradford: London GNOME Beer 3.0

In/near London on April the 8th? Like beer? Like pizza? Love GNOME? Then you probably want this wiki page.

8 March 2011

Rob Bradford: Autofoo: AC_ARG_ENABLE

Spent some of my morning fixing a build issue that came down to the use of AC_ARG_ENABLE. I thought it was worth recording some notes for its use: Something enabled by default:
AC_ARG_ENABLE([badgers],
              [AC_HELP_STRING([--enable-badgers=@<:@yes/no@:>@],
                              [Enable badgers @<:@default=yes@:>@])],
              [],
              [enable_badgers=yes])
Something disabled by default:
AC_ARG_ENABLE([badgers],
              [AC_HELP_STRING([--enable-badgers=@<:@yes/no@:>@],
                              [Enable badgers @<:@default=no@:>@])],
              [],
              [enable_badgers=no])
Observe that in this case the only thing that changes is the default value in the third parameter to AC_ARG_ENABLE (and also the documentation :-) ) Something enabled by default but with a disable syntax:
AC_ARG_ENABLE([badgers],
              [AC_HELP_STRING([--disable-badgers],
                              [Disable badgers],
              [],
              [enable_badgers=yes])
Notice that this case is just the same as the first except with the help changed. Of course you actually need to use the value from the flag. I think this is more readable if presented outside the AC_ARG_ENABLE parameters this is possible because AC_ARG_ENABLE always sets a variable called enable_<thing>. Awesome, huh? For conditional pkg-config:
AS_IF([test "x$enable_badgers" = "xyes"],
      [PKG_CHECK_MODULES(BADGERS, [badgers-1.0])],
      [])
For conditional building:
AM_CONDITIONAL(ENABLE_BADGERS, test "x$badgers" = xyes)

6 March 2011

Lars Wirzenius: Keep companies honest

Bradley Kuhn writes about the phrase "open core". In summary, avoid that phrase and use clearer, unambiguous terms instead. He also touches upon the practice of projects requiring copyright assignment before accepting contributions. Like Bradley, I dislike the practice. If a corporation requires me to assign copyright to them, so that they have power to change the license, I need to be paid in a currency that helps me pay rent. The only reason they need the power to change the license is to make the code proprietary, for the purpose of extorting users for money. If they get money, I want money. On the other hand, if I don't assign them copyright, I have a bit of power myself, and I can use that to keep them honest. If enough people do that, the corporation would have to rewrite all the code to make sure they have copyright on all of it, and can license it in a proprietary way. There's another kind of copyright assignment, which the FSF uses: the FSF gets copyright, but the contract with them guarantees that they'll not violate the freedom of the code. In return, the contributor gets relieved from the burden of having to deal with license violators, and can rely on the FSF to do that instead. This is an acceptable bargain to me, even though I haven't contributed enough to GNU projects to do a copyright assignment. There may be other entities than the FSF doing the same thing. I have not seen any others, though, so on the whole I consider copyright assignment as a way to trick me. With the exception of the FSF, copyright assignment smells dishonest to me.

27 January 2011

Rob Bradford: Work with us!

Are you graduating this year? Or recently graduated? Do you want to work in Open Source? Do you have the right to work in the UK? Do you want to work with some of the best minds in the field: Chris Lord, of Happy Wombats fame; Damien Lespiau, the Clutter GST mastermind; Emmanuele Bassi, our Clutter super-hero; Jussi Kukkonen, who puts the clue in Geoclue; Ross Burton, our EDS magician; Srini Ragavan, Evolution shiny-thing maker; Thomas Wood, of the MX and control-center massive; Tomas Frydrych, our Antarctic naming scheme generator and of course pippin. Interested? Take a look at our job entry. I should be around as FOSDEM so feel free to corner me to talk.

19 January 2011

Evgeni Golov: funny spam

yes, I do collect funny spam ;)Today I will present you some spammy funny comments I got on my (wordpress-powered) blog in the last months.
  1. 2010/09/24 at 03:08
    I cant believe, Facebook is currently down with a DNS failure. I guess Facebook having some issues. Businesses are reporting a near impossible 480% increase in productivity
  2. 2010/10/25 at 11:07
    Not quite on topic BUT really important: Please guys, donate something for Haiti! I just came back from a trip down there and I have to say the situation is really terrible! It s soon christmas time, so please be so kind and do something good! Thanks
  3. 2010/10/27 at 11:07
    Hey, I can t view your site properly within Opera, I actually hope you look into fixing this.
  4. 2010/11/18 at 12:26
    Are you watch Sarah Palin s TV show? I saw the trailer & wtf? -__- She s like, This is so much better than being in politics. It s like she s doing this just because she is losser. o.O What do you think? Do you believe she can be the next american presiden? ..
  5. 2010/11/30 at 15:40
    Wende im Proze gegen J rg Kachelmann. Die des Publikumslieblings haben ihr Mandat niederlegt. Der Proze in Mannheim soll dennoch planm ig fortgesetzt werden. Es werde keinen Antrag auf Aussetzung des Prozesses geben, f gte Rechtsanwalt Ralf Hoecker hinzu. Der neue Strafverteidiger Johann Schwenn (Hamburg) und die seit Verfahrensbeginn anwesende Pflichtverteidigerin Andrea Comb werden top vorbereitet sein, sagte er weiter.
  6. 2010/12/09 at 17:31
    How I can download documents from WikiLeaks?
    Thanks
  7. 2010/12/10 at 10:09
    Hi, you should check out http://www.voteonwikileaks.com. It s sort of like a crowdsourced collection of arguments against Wikileaks. Considering you re a blogger, i think you d find it to be an interesting read
  8. 2010/12/16 at 18:38
    It appears as though Julian Assange will at least be out on bail any minute but what about Bradley Manning? Solitary confinement for seven months so far without being convicted of anything, without a trial, even. That s bad!
  9. 2010/12/24 at 09:03
    Gibt es ein Jailbreakme f r Nerdomaten? Gen gend Bugs sollte es im IE6 geben :D
  10. 2010/12/29 at 13:22
    Der hammer bei den dingern ist aber eigentlich, dass sie nicht mit win7 laufen sondern mit: XP!!!!!!! Das ist doch der Hammer oder? hab auch ein Beweisvideo vom Hochfahren des Teils
  11. 2010/12/30 at 05:03
    Widespread criticism of RealVNC vulnerability is fixed
    A security report claimed that RealVNC software virtual network a high-risk vulnerability could allow a malicious attacker does not need a password to login to a remote system
  12. 2011/01/05 at 18:10
    is it true that it is not possible to use that sky thing with verizon?
Sorry, some are in German ;)What s so funny about them you ask? Well, the do contain actual content (even if not always matching to the posts they were attached to), they look relevant to the time they were posted and still are spam: the links in the homepage field of the comment led to spamsites. Numbers 9 and 10 even funnier: they are exact copies of already present comments to the same post.And oh, I think WikiLeaks should leak a cable about downloading stuff from WikiLeaks ;)

19 November 2010

Ond&#345;ej &#268;ert&iacute;k: Google Code vs GitHub for hosting opensource projects

Cython is now considering options where to move the main (mercurial) repository, and Robert Bradshaw (one of the main Cython developers) has asked me about my experience with regards to Google Code and GitHub, since we use both with SymPy.

Google Code is older, and it was the first service that provided free (virtually unlimited) number of projects that you could easily and immediately setup. At that time (4 years ago?) that was something unheard of. However, the GitHub guys in the meantime not only made this available too, but also implemented features, that (as far as I know) no one offers at all, in particular hosting your own pages at your own domain (but at GitHub's servers, some examples are sympy.org and docs.sympy.org), commenting on git branches and pull requests before the code gets merged in (I am 100% convinced that this is the right approach, as opposed to comment on the code after it gets in), allow to easily fork the repository and it has simply more social features, that the Google Code doesn't have.

I believe that managing an opensource project is mainly a social activity, and GitHub's social features really make so many things easier. From this point of view, GitHub is clearly the best choice today.

I think there is only one (but potentially big) problem with GitHub, that its issue tracker is very bad, compared to the Google Code one. For that reason (and also because we already use it), we keep our issues at Google Code with SymPy.

The above are the main things to consider. Now there are some little things to keep in mind, that I will briefly touch below: Google Code doesn't support git and blocks access from Cuba and other countries, when you want to change the front page, you need to be an admin, while at GitHub I simply add push access to all sympy developers, so anyone just pushes a patch to this repository: https://github.com/sympy/sympy.github.com, and it automatically appears on our front page (sympy.org), with Google Code we had to write long pages (in our docs) about how to send patches, with GitHub we just say, send us a pull request, and point to: http://help.github.com/pull-requests/. In other words, GitHub takes care of teaching people how to use git and figure out how to send patches, and we can concentrate on reviewing the patches and pushing them in.

Wikipages at github are maintained in git, and they provide the webfrontend to it as opensource, so there is no vendor lock-in. Anyone with github account can modify our wiki pages, while the Google Code pages can only be modified by people that I add to the Google Code project, which forced us to install mediawiki on my linode server (hosted at linode.com, which by the way is an excellent VPS hosting service, that I have been using for couple of years already and I can fully recommend it), and I had to manage it all the time, and now we are moving our pages to the github wiki, so that I have one less thing to worry about.

So as you can see, I, as admin, have less things to worry about, as github manages everything for me now, while with Google Code, I had to manage lots of things on my linodes.

One other thing to consider is that GitHub is only for git, but they also provide svn and hg access (both push and pull, they translate the repository automatically between git and svn/hg), I never really used it much, so I don't know how stable this is. As I wrote before, I think that git is the best tool now for maintaining a project, and I think that github is now the best choice to host it (except the issue tracker, where Google Code is better).

19 October 2010

Kees Cook: CVE-2010-2963 v4l compat exploit

If you re running a 64bit system, and you ve got users with access to a video device (/dev/video*), then be sure you update your kernels for CVE-2010-2963. I ve been slowly making my way through auditing the many uses in the Linux kernel of the copy_from_user() function, and ran into this vulnerability. Here s the kernel code from drivers/media/video/v4l2-compat-ioctl32.c:
static int get_microcode32(struct video_code *kp, struct video_code32 __user *up)
 
        if (!access_ok(VERIFY_READ, up, sizeof(struct video_code32))  
                copy_from_user(kp->loadwhat, up->loadwhat, sizeof(up->loadwhat))  
                get_user(kp->datasize, &up->datasize)  
                copy_from_user(kp->data, up->data, up->datasize))
                        return -EFAULT;
        return 0;
 
Note that kp->data is being used as the target for up->data in the final copy_from_user() without actually verifying that kp->data is pointing anywhere safe. Here s the caller of get_microcode32:
static long do_video_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 
        union  
                struct video_tuner vt;
                struct video_code vc;
...
          karg;
        void __user *up = compat_ptr(arg);
...
        switch (cmd)  
...
        case VIDIOCSMICROCODE:
                err = get_microcode32(&karg.vc, up);
...
So, the contents of up are totally under control of the caller, and the contents of karg (in our case, the video_code structure) are not initialized at all. So, it seems like a call for VIDIOCSMICROCODE would write video_code->datasize bytes from video_code->data into some random kernel address, just causing an Oops, since we don t control what is on the kernel s stack. But wait, who says we can t control the contents of the kernel s stack? In fact, this compat function makes it extremely easy. Let s look back at the union. Notice the struct video_tuner? That gets populated from the caller s up memory via this case of the switch (cmd) statement:
...
        case VIDIOCSTUNER:
        case VIDIOCGTUNER:
                err = get_video_tuner32(&karg.vt, up);
...
So, to control the kernel stack, we just need to call this ioctl twice in a row: once to populate the stack via VIDIOCSTUNER with the contents we want (including the future address for video_code->data, which starts at the same location as video_tuner->name[20]), and then again with VIDIOCSMICROCODE. Tricks involved here are: the definition of the VIDIOCSMICROCODE case in the kernel is wrong, and calling the ioctls without any preparation can trigger other kernel work (memory faults, etc) that may destroy the stack contents. First, we need the real value for the desired case statement. This turns out to be 0 4020761b. Next, we just repeatedly call the setup ioctl in an attempt to get incidental kernel work out of the way so that our last ioctl doing the stack preparation will stick, and then we call the buggy ioctl to trigger the vulnerability. Since the ioctl already does a multi-byte copy, we can now copy arbitrary lengths of bytes into kernel memory. One method of turning an arbitrary kernel memory write into a privilege escalation is to overwrite a kernel function pointer, and trigger that function. Based on the exploit for CVE-2010-3081, I opted to overwrite the security_ops function pointer table. Their use of msg_queue_msgctl wasn t very good for the general case since it s near the end of the table and its offset would depend on kernel versions. Initially I opted for getcap, but in the end used ptrace_traceme, both of which are very near the top the security_ops structure. (Though I need share credit here with Dan Rosenberg as we were working together on improving the reliability of the security_ops overwrite method. He used the same approach for his excellent RDS exploit.) Here are the steps for one way of taking an arbitrary kernel memory write and turning it into a root escalation: Here s the source for Vyakarana as seen running in Enlightenment using cap_getcap (which is pretty unstable, so you might want to switch it to use ptrace_traceme), and as a stand-alone memory writer. Conclusions: Keep auditing the kernel for more arbitrary writes; I think there are still many left. Reduce the exploitation surface within the kernel itself (which PaX and grsecurity have been doing for a while now), specifically:

15 October 2010

Enrico Zini: Award winning code

Award winning code Me and Yuwei had a fun day at hhhmcr (#hhhmcr) and even managed to put together a prototype that won the first prize \o/ We played with the gmp24 dataset kindly extracted from Twitter by Michael Brunton-Spall of the Guardian into a convenient JSON dataset. The idea was to find ways of making it easier to look at the data and making sense of it. This is the story of what we did, including the code we wrote. The original dataset has several JSON files, so the first task was to put them all together:
#!/usr/bin/python
# Merge the JSON data
# (C) 2010 Enrico Zini <enrico@enricozini.org>
# License: WTFPL version 2 (http://sam.zoy.org/wtfpl/)
import simplejson
import os
res = []
for f in os.listdir("."):
    if not f.startswith("gmp24"): continue
    data = open(f).read().strip()
    if data == "[]": continue
    parsed = simplejson.loads(data)
    res.extend(parsed)
print simplejson.dumps(res)
The results however were not ordered by date, as GMP had to use several accounts to twit because Twitter was putting Greather Manchester Police into jail for generating too much traffic. There would be quite a bit to write about that, but let's stick to our work. Here is code to sort the JSON data by time:
#!/usr/bin/python
# Sort the JSON data
# (C) 2010 Enrico Zini <enrico@enricozini.org>
# License: WTFPL version 2 (http://sam.zoy.org/wtfpl/)
import simplejson
import sys
import datetime as dt
all_recs = simplejson.load(sys.stdin)
all_recs.sort(key=lambda x: dt.datetime.strptime(x["created_at"], "%a %b %d %H:%M:%S +0000 %Y"))
simplejson.dump(all_recs, sys.stdout)
I then wanted to play with Tf-idf for extracting the most important words of every tweet:
#!/usr/bin/python
# tfifd - Annotate JSON elements with Tf-idf extracted keywords
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
import sys, math
import simplejson
import re
# Read all the twits
records = simplejson.load(sys.stdin)
# All the twits by ID
byid = dict(((x["id"], x) for x in records))
# Stopwords we ignore
stopwords = set(["by", "it", "and", "of", "in", "a", "to"])
# Tokenising engine
re_num = re.compile(r"^\d+$")
re_word = re.compile(r"(\w+)")
def tokenise(tweet):
    "Extract tokens from a tweet"
    for tok in tweet["text"].split():
        tok = tok.strip().lower()
        if re_num.match(tok): continue
        mo = re_word.match(tok)
        if not mo: continue
        if mo.group(1) in stopwords: continue
        yield mo.group(1)
# Extract tokens from tweets
tokenised = dict(((x["id"], list(tokenise(x))) for x in records))
# Aggregate token counts
aggregated =  
for d in byid.iterkeys():
    for t in tokenised[d]:
        if t in aggregated:
            aggregated[t] += 1
        else:
            aggregated[t] = 1
def tfidf(doc, tok):
    "Compute TFIDF score of a token in a document"
    return doc.count(tok) * math.log(float(len(byid)) / aggregated[tok])
# Annotate tweets with keywords
res = []
for name, tweet in byid.iteritems():
    doc = tokenised[name]
    keywords = sorted(set(doc), key=lambda tok: tfidf(doc, tok), reverse=True)[:5]
    tweet["keywords"] = keywords
    res.append(tweet)
simplejson.dump(res, sys.stdout)
I thought this was producing a nice summary of every tweet but nobody was particularly interested, so we moved on to adding categories to tweet. Thanks to Yuwei who put together some useful keyword sets, we managed to annotate each tweet with a place name (i.e. "Stockport"), a social place name (i.e. "pub", "bank") and a social category (i.e. "man", "woman", "landlord"...) The code is simple; the biggest work in it was the dictionary of keywords:
#!/usr/bin/python
# categorise - Annotate JSON elements with categories
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
# Copyright (C) 2010  Yuwei Lin <yuwei@ylin.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
import sys, math
import simplejson
import re
# Electoral wards from http://en.wikipedia.org/wiki/List_of_electoral_wards_in_Greater_Manchester
placenames = ["Altrincham", "Sale West",
"Altrincham", "Ashton upon Mersey", "Bowdon", "Broadheath", "Hale Barns", "Hale Central", "St Mary", "Timperley", "Village",
"Ashton-under-Lyne",
"Ashton Hurst", "Ashton St Michael", "Ashton Waterloo", "Droylsden East", "Droylsden West", "Failsworth East", "Failsworth West", "St Peter",
"Blackley", "Broughton",
"Broughton", "Charlestown", "Cheetham", "Crumpsall", "Harpurhey", "Higher Blackley", "Kersal",
"Bolton North East",
"Astley Bridge", "Bradshaw", "Breightmet", "Bromley Cross", "Crompton", "Halliwell", "Tonge with the Haulgh",
"Bolton South East",
"Farnworth", "Great Lever", "Harper Green", "Hulton", "Kearsley", "Little Lever", "Darcy Lever", "Rumworth",
"Bolton West",
"Atherton", "Heaton", "Lostock", "Horwich", "Blackrod", "Horwich North East", "Smithills", "Westhoughton North", "Chew Moor", "Westhoughton South",
"Bury North",
"Church", "East", "Elton", "Moorside", "North Manor", "Ramsbottom", "Redvales", "Tottington",
"Bury South",
"Besses", "Holyrood", "Pilkington Park", "Radcliffe East", "Radcliffe North", "Radcliffe West", "St Mary", "Sedgley", "Unsworth",
"Cheadle",
"Bramhall North", "Bramhall South", "Cheadle", "Gatley", "Cheadle Hulme North", "Cheadle Hulme South", "Heald Green", "Stepping Hill",
"Denton", "Reddish",
"Audenshaw", "Denton North East", "Denton South", "Denton West", "Dukinfield", "Reddish North", "Reddish South",
"Hazel Grove",
"Bredbury", "Woodley", "Bredbury Green", "Romiley", "Hazel Grove", "Marple North", "Marple South", "Offerton",
"Heywood", "Middleton",
"Bamford", "Castleton", "East Middleton", "Hopwood Hall", "Norden", "North Heywood", "North Middleton", "South Middleton", "West Heywood", "West Middleton",
"Leigh",
"Astley Mosley Common", "Atherleigh", "Golborne", "Lowton West", "Leigh East", "Leigh South", "Leigh West", "Lowton East", "Tyldesley",
"Makerfield",
"Abram", "Ashton", "Bryn", "Hindley", "Hindley Green", "Orrell", "Winstanley", "Worsley Mesnes",
"Manchester Central",
"Ancoats", "Clayton", "Ardwick", "Bradford", "City Centre", "Hulme", "Miles Platting", "Newton Heath", "Moss Side", "Moston",
"Manchester", "Gorton",
"Fallowfield", "Gorton North", "Gorton South", "Levenshulme", "Longsight", "Rusholme", "Whalley Range",
"Manchester", "Withington",
"Burnage", "Chorlton", "Chorlton Park", "Didsbury East", "Didsbury West", "Old Moat", "Withington",
"Oldham East", "Saddleworth",
"Alexandra", "Crompton", "Saddleworth North", "Saddleworth South", "Saddleworth West", "Lees", "St James", "St Mary", "Shaw", "Waterhead",
"Oldham West", "Royton",
"Chadderton Central", "Chadderton North", "Chadderton South", "Coldhurst", "Hollinwood", "Medlock Vale", "Royton North", "Royton South", "Werneth",
"Rochdale",
"Balderstone", "Kirkholt", "Central Rochdale", "Healey", "Kingsway", "Littleborough Lakeside", "Milkstone", "Deeplish", "Milnrow", "Newhey", "Smallbridge", "Firgrove", "Spotland", "Falinge", "Wardle", "West Littleborough",
"Salford", "Eccles",
"Claremont", "Eccles", "Irwell Riverside", "Langworthy", "Ordsall", "Pendlebury", "Swinton North", "Swinton South", "Weaste", "Seedley",
"Stalybridge", "Hyde",
"Dukinfield Stalybridge", "Hyde Godley", "Hyde Newton", "Hyde Werneth", "Longdendale", "Mossley", "Stalybridge North", "Stalybridge South",
"Stockport",
"Brinnington", "Central", "Davenport", "Cale Green", "Edgeley", "Cheadle Heath", "Heatons North", "Heatons South", "Manor",
"Stretford", "Urmston",
"Bucklow-St Martins", "Clifford", "Davyhulme East", "Davyhulme West", "Flixton", "Gorse Hill", "Longford", "Stretford", "Urmston",
"Wigan",
"Aspull New Springs Whelley", "Douglas", "Ince", "Pemberton", "Shevington with Lower Ground", "Standish with Langtree", "Wigan Central", "Wigan West",
"Worsley", "Eccles South",
"Barton", "Boothstown", "Ellenbrook", "Cadishead", "Irlam", "Little Hulton", "Walkden North", "Walkden South", "Winton", "Worsley",
"Wythenshawe", "Sale East",
"Baguley", "Brooklands", "Northenden", "Priory", "Sale Moor", "Sharston", "Woodhouse Park"]
# Manual coding from Yuwei
placenames.extend(["City centre", "Tameside", "Oldham", "Bury", "Bolton",
"Trafford", "Pendleton", "New Moston", "Denton", "Eccles", "Leigh", "Benchill",
"Prestwich", "Sale", "Kearsley", ])
placenames.extend(["Trafford", "Bolton", "Stockport", "Levenshulme", "Gorton",
"Tameside", "Blackley", "City centre", "Airport", "South Manchester",
"Rochdale", "Chorlton", "Uppermill", "Castleton", "Stalybridge", "Ashton",
"Chadderton", "Bury", "Ancoats", "Whalley Range", "West Yorkshire",
"Fallowfield", "New Moston", "Denton", "Stretford", "Eccles", "Pendleton",
"Leigh", "Altrincham", "Sale", "Prestwich", "Kearsley", "Hulme", "Withington",
"Moss Side", "Milnrow", "outskirt of Manchester City Centre", "Newton Heath",
"Wythenshawe", "Mancunian Way", "M60", "A6", "Droylesden", "M56", "Timperley",
"Higher Ince", "Clayton", "Higher Blackley", "Lowton", "Droylsden",
"Partington", "Cheetham Hill", "Benchill", "Longsight", "Didsbury",
"Westhoughton"])
# Social categories from Yuwei
soccat = ["man", "woman", "men", "women", "youth", "teenager", "elderly",
"patient", "taxi driver", "neighbour", "male", "tenant", "landlord", "child",
"children", "immigrant", "female", "workmen", "boy", "girl", "foster parents",
"next of kin"]
for i in range(100):
    soccat.append("%d-year-old" % i)
    soccat.append("%d-years-old" % i)
# Types of social locations from Yuwei
socloc = ["car park", "park", "pub", "club", "shop", "premises", "bus stop",
"property", "credit card", "supermarket", "garden", "phone box", "theatre",
"toilet", "building site", "Crown court", "hard shoulder", "telephone kiosk",
"hotel", "restaurant", "cafe", "petrol station", "bank", "school",
"university"]
extras =   "placename": placenames, "soccat": soccat, "socloc": socloc  
# Normalise keyword lists
for k, v in extras.iteritems():
    # Remove duplicates
    v = list(set(v))
    # Sort by length
    v.sort(key=lambda x:len(x), reverse=True)
# Add keywords
def add_categories(tweet):
    text = tweet["text"].lower()
    for field, categories in extras.iteritems():
        for cat in categories:
            if cat.lower() in text:
                tweet[field] = cat
                break
    return tweet
# Read all the twits
records = (add_categories(x) for x in simplejson.load(sys.stdin))
simplejson.dump(list(records), sys.stdout)
All these scripts form a nice processing chain: each script takes a list of JSON records, adds some bit and passes it on. In order to see what we have so far, here is a simple script to convert the JSON twits to CSV so they can be viewed in a spreadsheet:
#!/usr/bin/python
# Convert the JSON twits to CSV
# (C) 2010 Enrico Zini <enrico@enricozini.org>
# License: WTFPL version 2 (http://sam.zoy.org/wtfpl/)
import simplejson
import sys
import csv
rows = ["id", "created_at", "text", "keywords", "placename"]
writer = csv.writer(sys.stdout)
for rec in simplejson.load(sys.stdin):
    rec["keywords"] = " ".join(rec["keywords"])
    rec["placename"] = rec.get("placename", "")
    writer.writerow([rec[row] for row in rows])
At this point we were coming up with lots of questions: "were there more reports on women or men?", "which place had most incidents?", "what were the incidents involving animals?"... Time to bring Xapian into play. This script reads all the JSON tweets and builds a Xapian index with them:
#!/usr/bin/python
# toxapian - Index JSON tweets in Xapian
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
import simplejson
import sys
import os, os.path
import xapian
DBNAME = sys.argv[1]
db = xapian.WritableDatabase(DBNAME, xapian.DB_CREATE_OR_OPEN)
stemmer = xapian.Stem("english")
indexer = xapian.TermGenerator()
indexer.set_stemmer(stemmer)
indexer.set_database(db)
data = simplejson.load(sys.stdin)
for rec in data:
    doc = xapian.Document()
    doc.set_data(str(rec["id"]))
    indexer.set_document(doc)
    indexer.index_text_without_positions(rec["text"])
    # Index categories as categories
    if "placename" in rec:
        doc.add_boolean_term("XP" + rec["placename"].lower())
    if "soccat" in rec:
        doc.add_boolean_term("XS" + rec["soccat"].lower())
    if "socloc" in rec:
        doc.add_boolean_term("XL" + rec["socloc"].lower())
    db.add_document(doc)
db.flush()
# Also save the whole dataset so we know where to find it later if we want to
# show the details of an entry
simplejson.dump(data, open(os.path.join(DBNAME, "all.json"), "w"))
And this is a simple command line tool to query to the database:
#!/usr/bin/python
# xgrep - Command line tool to query the GMP24 tweet Xapian database
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
import simplejson
import sys
import os, os.path
import xapian
DBNAME = sys.argv[1]
db = xapian.Database(DBNAME)
stem = xapian.Stem("english")
qp = xapian.QueryParser()
qp.set_default_op(xapian.Query.OP_AND)
qp.set_database(db)
qp.set_stemmer(stem)
qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME)
qp.add_boolean_prefix("place", "XP")
qp.add_boolean_prefix("soc", "XS")
qp.add_boolean_prefix("loc", "XL")
query = qp.parse_query(sys.argv[2],
    xapian.QueryParser.FLAG_BOOLEAN  
    xapian.QueryParser.FLAG_LOVEHATE  
    xapian.QueryParser.FLAG_BOOLEAN_ANY_CASE  
    xapian.QueryParser.FLAG_WILDCARD  
    xapian.QueryParser.FLAG_PURE_NOT  
    xapian.QueryParser.FLAG_SPELLING_CORRECTION  
    xapian.QueryParser.FLAG_AUTO_SYNONYMS)
enquire = xapian.Enquire(db)
enquire.set_query(query)
count = 40
matches = enquire.get_mset(0, count)
estimated = matches.get_matches_estimated()
print "%d/%d results" % (matches.size(), estimated)
data = dict((str(x["id"]), x) for x in simplejson.load(open(os.path.join(DBNAME, "all.json"))))
for m in matches:
    rec = data[m.document.get_data()]
    print rec["text"]
print "%d/%d results" % (matches.size(), matches.get_matches_estimated())
total = db.get_doccount()
estimated = matches.get_matches_estimated()
print "%d results over %d documents, %d%%" % (estimated, total, estimated * 100 / total)
Neat! Now that we have a proper index that supports all sort of cool things, like stemming, tag clouds, full text search with complex queries, lookup of similar documents, suggest keywords and so on, it was just fair to put together a web service to share it with other people at the event. It helped that I had already written similar code for apt-xapian-index and dde before. Here is the server, quickly built on bottle. The very last line starts the server and it is where you can configure the listening interface and port.
#!/usr/bin/python
# xserve - Make the GMP24 tweet Xapian database available on the web
#
# Copyright (C) 2010  Enrico Zini <enrico@enricozini.org>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
import bottle
from bottle import route, post
from cStringIO import StringIO
import cPickle as pickle
import simplejson
import sys
import os, os.path
import xapian
import urllib
import math
bottle.debug(True)
DBNAME = sys.argv[1]
QUERYLOG = os.path.join(DBNAME, "queries.txt")
data = dict((str(x["id"]), x) for x in simplejson.load(open(os.path.join(DBNAME, "all.json"))))
prefixes =   "place": "XP", "soc": "XS", "loc": "XL"  
prefix_desc =   "place": "Place name", "soc": "Social category", "loc": "Social location"  
db = xapian.Database(DBNAME)
stem = xapian.Stem("english")
qp = xapian.QueryParser()
qp.set_default_op(xapian.Query.OP_AND)
qp.set_database(db)
qp.set_stemmer(stem)
qp.set_stemming_strategy(xapian.QueryParser.STEM_SOME)
for k, v in prefixes.iteritems():
    qp.add_boolean_prefix(k, v)
def make_query(qstring):
    return qp.parse_query(qstring,
        xapian.QueryParser.FLAG_BOOLEAN  
        xapian.QueryParser.FLAG_LOVEHATE  
        xapian.QueryParser.FLAG_BOOLEAN_ANY_CASE  
        xapian.QueryParser.FLAG_WILDCARD  
        xapian.QueryParser.FLAG_PURE_NOT  
        xapian.QueryParser.FLAG_SPELLING_CORRECTION  
        xapian.QueryParser.FLAG_AUTO_SYNONYMS)
@route("/")
def index():
    query = urllib.unquote_plus(bottle.request.GET.get("q", ""))
    out = StringIO()
    print >>out, '''
<html>
<head>
<title>Query</title>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js"></script>
<script type="text/javascript">
$(function() 
    $("#queryfield")[0].focus()
 )
</script>
</head>
<body>
<h1>Search</h1>
<form method="POST" action="/query">
Keywords: <input type="text" name="query" value="%s" id="queryfield">
<input type="submit">
<a href="http://xapian.org/docs/queryparser.html">Help</a>
</form>''' % query
    print >>out, '''
<p>Example: "car place:wigan"</p>

<p>Available prefixes:</p>

<ul>
'''
    for pfx in prefixes.keys():
        print >>out, "<li><a href='/catinfo/%s'>%s - %s</a></li>" % (pfx, pfx, prefix_desc[pfx])
    print >>out, '''
</ul>
'''
    oldqueries = []
    if os.path.exists(QUERYLOG):
        total = db.get_doccount()
        fd = open(QUERYLOG, "r")
        while True:
            try:
                q = pickle.load(fd)
            except EOFError:
                break
            oldqueries.append(q)
        fd.close()
        def print_query(q):
            count = q["count"]
            print >>out, "<li><a href='/query?query=%s'>%s (%d/%d %.2f%%)</a></li>" % (urllib.quote_plus(q["q"]), q["q"], count, total, count * 100.0 / total)
        print >>out, "<p>Last 10 queries:</p><ul>"
        for q in oldqueries[:-10:-1]:
            print_query(q)
        print >>out, "</ul>"
        # Remove duplicates
        oldqueries = dict(((x["q"], x) for x in oldqueries)).values()
        print >>out, "<table>"
        print >>out, "<tr><th>10 queries with most results</th><th>10 queries with least results</th></tr>"
        print >>out, "<tr><td>"
        print >>out, "<ul>"
        oldqueries.sort(key=lambda x:x["count"], reverse=True)
        for q in oldqueries[:10]:
            print_query(q)
        print >>out, "</ul>"
        print >>out, "</td><td>"
        print >>out, "<ul>"
        nonempty = [x for x in oldqueries if x["count"] > 0]
        nonempty.sort(key=lambda x:x["count"])
        for q in nonempty[:10]:
            print_query(q)
        print >>out, "</ul>"
        print >>out, "</td></tr>"
        print >>out, "</table>"
    print >>out, '''
</body>
</html>'''
    return out.getvalue()
@route("/query")
@route("/query/")
@post("/query")
@post("/query/")
def query():
    query = bottle.request.POST.get("query", bottle.request.GET.get("query", ""))
    enquire = xapian.Enquire(db)
    enquire.set_query(make_query(query))
    count = 40
    matches = enquire.get_mset(0, count)
    estimated = matches.get_matches_estimated()
    total = db.get_doccount()
    out = StringIO()
    print >>out, '''
<html>
<head><title>Results</title></head>
<body>
<h1>Results for "<b>%s</b>"</h1>
''' % query
    if estimated == 0:
        print >>out, "No results found."
    else:
        # Give as results the first 30 documents; also use them as the key
        # ones to use to compute relevant terms
        rset = xapian.RSet()
        for m in enquire.get_mset(0, 30):
            rset.add_document(m.document.get_docid())
        # Compute the tag cloud
        class NonTagFilter(xapian.ExpandDecider):
            def __call__(self, term):
                return not term[0].isupper() and not term[0].isdigit()
        cloud = []
        maxscore = None
        for res in enquire.get_eset(40, rset, NonTagFilter()):
            # Normalise the score in the interval [0, 1]
            weight = math.log(res.weight)
            if maxscore == None: maxscore = weight
            tag = res.term
            cloud.append([tag, float(weight) / maxscore])
        max_weight = cloud[0][1]
        min_weight = cloud[-1][1]
        cloud.sort(key=lambda x:x[0])
        def mklink(query, term):
            return "/query?query=%s" % urllib.quote_plus(query + " and " + term)
        print >>out, "<h2>Tag cloud</h2>"
        print >>out, "<blockquote>"
        for term, weight in cloud:
            size = 100 + 100.0 * (weight - min_weight) / (max_weight - min_weight)
            print >>out, "<a href='%s' style='font-size:%d%%; color:brown;'>%s</a>" % (mklink(query, term), size, term)
        print >>out, "</blockquote>"
        print >>out, "<h2>Results</h2>"
        print >>out, "<p><a href='/'>Search again</a></p>"
        print >>out, "<p>%d results over %d documents, %.2f%%</p>" % (estimated, total, estimated * 100.0 / total)
        print >>out, "<p>%d/%d results</p>" % (matches.size(), estimated)
        print >>out, "<ul>"
        for m in matches:
            rec = data[m.document.get_data()]
            print >>out, "<li><a href='/item/%s'>%s</a></li>" % (rec["id"], rec["text"])
        print >>out, "</ul>"
        fd = open(QUERYLOG, "a")
        qinfo = dict(q=query, count=estimated)
        pickle.dump(qinfo, fd)
        fd.close()
    print >>out, '''
<a href="/">Search again</a>

</body>
</html>'''
    return out.getvalue()
@route("/item/:id")
@route("/item/:id/")
def show(id):
    rec = data[id]
    out = StringIO()
    print >>out, '''
<html>
<head><title>Result %s</title></head>
<body>
<h1>Raw JSON record for twit %s</h1>
<pre>''' % (rec["id"], rec["id"])
    print >>out, simplejson.dumps(rec, indent=" ")
    print >>out, '''
</pre>
</body>
</html>'''
    return out.getvalue()
@route("/catinfo/:name")
@route("/catinfo/:name/")
def catinfo(name):
    prefix = prefixes[name]
    out = StringIO()
    print >>out, '''
<html>
<head><title>Values for %s</title></head>
<body>
''' % name
    terms = [(x.term[len(prefix):], db.get_termfreq(x.term)) for x in db.allterms(prefix)]
    terms.sort(key=lambda x:x[1], reverse=True)
    freq_min = terms[0][1]
    freq_max = terms[-1][1]
    def mklink(name, term):
        return "/query?query=%s" % urllib.quote_plus(name + ":" + term)
    # Build tag cloud
    print >>out, "<h1>Tag cloud</h1>"
    print >>out, "<blockquote>"
    for term, freq in sorted(terms[:20], key=lambda x:x[0]):
        size = 100 + 100.0 * (freq - freq_min) / (freq_max - freq_min)
        print >>out, "<a href='%s' style='font-size:%d%%; color:brown;'>%s</a>" % (mklink(name, term), size, term)
    print >>out, "</blockquote>"
    print >>out, "<h1>All terms</h1>"
    print >>out, "<table>"
    print >>out, "<tr><th>Occurrences</th><th>Name</th></tr>"
    for term, freq in terms:
        print >>out, "<tr><td>%d</td><td><a href='/query?query=%s'>%s</a></td></tr>" % (freq, urllib.quote_plus(name + ":" + term), term)
    print >>out, "</table>"
    print >>out, '''
</body>
</html>'''
    return out.getvalue()
# Change here for bind host and port
bottle.run(host="0.0.0.0", port=8024)
...and then we presented our work and ended up winning the contest. This was the story of how we wrote this set of award winning code.

Next.

Previous.